[SPARK-54405][SQL][Metric View] CREATE command and SELECT query resolution by linhongliu-db · Pull Request #53158 · apache/spark

linhongliu-db · 2025-11-21T20:26:21Z

What changes were proposed in this pull request?

This PR implements the command to create metric views and the analysis rule to resolve a metric view query:

CREATE Metric view
- Add SQL grammar to support WITH METRIC when creating a view
- Add dollar-quoted string support for YAML definitions
- Implement CreateMetricViewCommand to analyze the view body
- Use a table property to indicate that the View is a metric view since HIVE has no dedicated table type
SELECT Metric view
- Update SessionCatalog to parse metric view definitions on read
- Add MetricViewPlanner utility to parse the YAML definition and construct an unresolved plan
- Add ResolveMetricView rule to substitute the dimensions and measures reference to actual expressions

NOTE: This PR depends on #53146

This PR also marks org.apache.spark.sql.metricview as an internal package

Why are the changes needed?

SPIP: Metrics & semantic modeling in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

build/sbt "hive/testOnly  org.apache.spark.sql.execution.SimpleMetricViewSuite"
build/sbt "hive/testOnly  org.apache.spark.sql.hive.execution.HiveMetricViewSuite"

Was this patch authored or co-authored using generative AI tooling?

No

linhongliu-db · 2025-11-24T19:48:06Z

cc @cloud-fan to review

cloud-fan · 2025-12-04T19:24:30Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4

+
+mode DOLLAR_QUOTED_STRING_MODE;
+
+    DOLLAR_QUOTED_STRING_BODY


wrong indentation?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Measure.scala

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

cloud-fan · 2025-12-04T19:32:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/metricViewCommands.scala

+    val name = child match {
+      case v: ResolvedIdentifier =>
+        v.identifier.asTableIdentifier
+      case _ => throw QueryCompilationErrors.loadDataNotSupportedForV2TablesError()


this should not happen, right? CheckAnalysis should fail earlier. We can throw internal error here.

you are right, removed

- Add METRICS keyword to lexer - Add dollar-quoted string support for YAML definitions - Add createMetricView production rule - Add METRIC_VIEW catalog table type - Implement CreateMetricViewCommand: - Parse YAML definition - Build MetricViewPlaceholder logical node - Validate and analyze metric view - Add MetricViewPlaceholder logical node with tree patterns - Update ViewHelper to support metric view creation - Add basic test suite for metric views - Add MetricViewPlanner utility: - planRead() to parse metric view for SELECT queries - planWrite() refactored from metricViewCommands - parseYAML() shared parsing logic - Add ResolveMetricView analyzer rule: - Transform MetricViewPlaceholder into aggregation queries - Parse dimensions and measures from schema metadata - Build Project with dimensions and Aggregate with measures - Handle measure references in aggregates - Update SessionCatalog to parse metric view definitions on read - Update EliminateView to handle ResolvedMetricView nodes - Refactor CreateMetricViewCommand to use MetricViewPlanner - Update ViewHelper to set METRIC_VIEW table type correctly - Add ResolveMetricView to analyzer rule chain - Update test suite with query tests update test test

cloud-fan · 2025-12-11T09:40:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Measure.scala

+
+  override def prettyName: String = getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("measure")
+
+  override def nullable: Boolean = true


should it be child.nullable?

sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/util/MetricViewPlanner.scala

cloud-fan · 2025-12-11T09:45:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

    }
  }

+  override def visitCodeLiteral(ctx: CodeLiteralContext): String = {


shall we put this in AstBuilder?

cloud-fan · 2025-12-11T09:47:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+
+    if (ctx.METRICS(0) == null) {
+      throw QueryParsingErrors.missingClausesForOperation(
+        ctx, "WITH METRICS", "CREATE METRIC VIEW")


This is a bit misleading as we don't really support the syntax CREATE METRIC VIEW, shall we just say metric view creation?

cloud-fan · 2025-12-11T09:47:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+
+    if (ctx.routineLanguage(0) == null) {
+      throw QueryParsingErrors.missingClausesForOperation(
+        ctx, "LANGUAGE", "CREATE METRIC VIEW")


sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

cloud-fan · 2025-12-11T09:48:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+    }
+
+    val languageCtx = ctx.routineLanguage(0)
+    withOrigin(languageCtx) {


do we need to do this?

cloud-fan · 2025-12-11T09:48:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+      .getOrElse(Map.empty)
+    val codeLiteral = visitCodeLiteral(ctx.codeLiteral())
+
+    withIdentClause(ctx.identifierReference(), ident => {


I think it's simpler to do

CreateMetricViewCommand( withIdentClause(...), userSpecifiedColumns, visitCommentSpecList(ctx.commentSpec()), properties, codeLiteral, allowExisting = ctx.EXISTS != null, replace = ctx.REPLACE != null )

cloud-fan · 2025-12-11T09:52:17Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

+ *          groupingExpressions = [region],
+ *          aggregateExpressions = [region, sum(amount), avg(price)],
+ *          child = Filter(upper(region) = 'REGION_1',
+ *                   Filter(product = 'product_1', sales_table))


where is the aforementioned Project?

added. and fixed the upper filter expression to use dimension AttributeReference.

cloud-fan · 2025-12-11T10:19:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/metricViewCommands.scala

+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override val output: Seq[Attribute] = Seq(
+    AttributeReference("result", StringType, nullable = false)()


do we have any output?

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

…and-select

cloud-fan · 2025-12-12T03:54:12Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

+ * 2. Load and parse the stored metric view definition from catalog metadata
+ * 3. Build a [[Project]] node that:
+ *    - Projects dimension expressions: [region, upper(region) AS region_upper]
+ *    - Includes non-conflicting source columns for filters


but in the example, we filter by region_upper. I think the main reason is for the measure agg functions to reference columns?

yes, that's right. I updated the comment to make it more clear

cloud-fan · 2025-12-12T03:58:12Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

+        // 3. metric view output should use the same exprId
+        val sourceProjList = sourceOutput.filterNot { attr =>
+          // conflict with dimensions
+          metricView.outputMetrics


outputMetrics contains both dimension and measure columns, shall we filter out dimension columns first before we look up the column name?

cloud-fan · 2025-12-12T03:58:55Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

+          if (attr.metadata.contains(MetricViewConstants.COLUMN_TYPE_PROPERTY_KEY)) {
+            // no alias for metric view column since the measure reference needs to use the
+            // measure column in MetricViewPlaceholder, but an alias will change the exprId
+            attr


then will it have issues with DeduplicateRelation?

I added a test case to verify that union two same metric view query will work well.
In the test, before DeduplicateRelation, the plan is below

Union false, false :- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902] : +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view` : +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905] : +- SubqueryAlias spark_catalog.default.test_table : +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet +- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902] +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view` +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905] +- SubqueryAlias spark_catalog.default.test_table +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet

after DeduplicateRelation, the plan is below

region: string, total_count: bigint, avg_price: double Union false, false :- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902] : +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view` : +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905] : +- SubqueryAlias spark_catalog.default.test_table : +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet +- Aggregate [region#1920], [region#1920, sum(count#1918) AS total_count#1923L, avg(price#1919) AS avg_price#1924] +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view` +- Project [count#1916 AS count#1918, price#1917 AS price#1919, cast(region#1914 as string) AS region#1920, cast(product#1915 as string) AS product#1921, cast(upper(region#1914) as string) AS region_upper#1922] +- SubqueryAlias spark_catalog.default.test_table +- Relation spark_catalog.default.test_table[region#1914,product#1915,count#1916,price#1917] parquet

then why do we need to add alias at all?

this is not needed in this PR actually. But to support group by gropuing sets, there will be a Expand + Project between Aggregate and ResolvedMetricView node, and I found without this alias, DeduplicateRelation will failed to update the exprId for the Expand + Project and lead to dangling references.

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

cloud-fan · 2025-12-15T02:41:34Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala

+        }.map { attr =>
+          if (attr.metadata.contains(MetricViewConstants.COLUMN_TYPE_PROPERTY_KEY)) {
+            // no alias for metric view column since the measure reference needs to use the
+            // measure column in MetricViewPlaceholder, but an alias will change the exprId


MetricViewPlaceholder.output is fully decoupled with its child. To not change the exprId we need to add an alias to retain the original exprId, and you are doing the opposite. The Project should be constructed like this:

Project( Seq( Alias(source_attr_1, name)(exprId = output_metric_1.exprId), ... ), source )

Actually this is similar to how we output dimension cols, see https://github.com/apache/spark/pull/53158/files#diff-1f33f825cb8bc9e947d5f021b7e21ec37996d97b2d65375dd0c430a38f1c7c25R336

oh actually here we are adding new columns not in the outputMetrics, why the exprId matters?

…and-select

cloud-fan · 2025-12-17T04:41:46Z

thanks, merging to master!

github-actions bot added the SQL label Nov 21, 2025

cloud-fan reviewed Dec 4, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Measure.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 4, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala Show resolved Hide resolved

cloud-fan reviewed Dec 4, 2025

View reviewed changes

linhongliu-db added 5 commits December 10, 2025 17:46

code clean

fb8a002

more test

a5ba2b8

fix

ba9c5ea

add detailed comments to describe the metric view workflow

9d97100

linhongliu-db force-pushed the metric-view-create-and-select branch from 0cbded9 to 9d97100 Compare December 10, 2025 19:10

linhongliu-db added 2 commits December 10, 2025 19:52

fix scalastyle

6727b47

fix test

e414a5e

cloud-fan reviewed Dec 11, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/util/MetricViewPlanner.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 11, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 11, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 11, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala Show resolved Hide resolved

linhongliu-db added 3 commits December 11, 2025 19:01

Merge remote-tracking branch 'apache/master' into metric-view-create-…

706bf11

…and-select

address comments and fix tests

d3f1759

internal class

2c02fd3

github-actions bot added the BUILD label Dec 11, 2025

github-actions bot added DOCS CONNECT labels Dec 11, 2025

cloud-fan reviewed Dec 12, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMetricView.scala Outdated Show resolved Hide resolved

linhongliu-db added 4 commits December 12, 2025 17:00

test for DeduplicateRelation

63f09eb

address comment

2abf844

fix test

81845e4

trigger test

f9f48f8

cloud-fan reviewed Dec 15, 2025

View reviewed changes

handle DeduplicateRealtion

b3226cb

cloud-fan approved these changes Dec 16, 2025

View reviewed changes

linhongliu-db added 2 commits December 16, 2025 16:35

Merge remote-tracking branch 'apache/master' into metric-view-create-…

06211ce

…and-select

merge conflict

563bfc1

cloud-fan closed this in b7fda7c Dec 17, 2025


		override def prettyName: String = getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("measure")

		override def nullable: Boolean = true

Conversation

linhongliu-db commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

linhongliu-db commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloud-fan Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

linhongliu-db commented Nov 21, 2025 •

edited

Loading

cloud-fan Dec 4, 2025 •

edited

Loading

cloud-fan Dec 15, 2025 •

edited

Loading